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ABSTRACT 


We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element 
problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions 
with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact 
finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping 
Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding 
continuation strategy for problems with weak shocks, is robust and economical for this class of 
mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several 
parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, 
subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on 
numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory 
parallel computer. 


^his author’s work was supported in part by NSF grants ASC-9457534, ASC-9217394, and ECS-9527169, 
by NASA grant NAG5-2218, and by NASA contract NAS1-19480 while the author was in residence at the 
Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, 
Hampton, VA 23681-0001. 

2 This author’s work was supported by the Office of Scientific Computing, U.S. Department of Energy, 
under Contract W-31-109-Eng-38. 

3 This author’s work was supported in part by NSF grants ECS-8957475 and ECS-9527169, and by NASA 
contract NASI- 19480 while the author was in residence at the Institute for Computer Applications in Science 
and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001. 


I 





1 Introduction 


In the past few years domain decomposition methods for linear partial differential equations, 
including overlapping Schwarz methods [9, 12, 13, 37], have graduated from theory into 
practice in many applications [17, 27, 28, 34]. In this paper, we study several aspects of the 
parallel implementation of a Krylov-Schwarz domain decomposition algorithm for the finite 
element solution of the nonlinear full potential equation of aerodynamics, extending our 
model studies of linear convection-diffusion problems in [5] and of linear aerodynamic design 
optimization problems in [33]. Newton-Krylov methods [2, 3, 14, 15, 39] are potentially well 
suited and increasingly popular for the implicit solution of nonlinear problems whenever it 
is expensive to compute or store a true Jacobian. We employ a combined algorithm, called 
Newton-Krylov-Schwarz, and focus on the interplay of the three nested components of the 
algorithm, since the amount of work done in each component affects and is affected by the 
work done in the others. 

Newton-Krylov-Schwarz is a general purpose parallel solver for nonlinear partial differ- 
ential equations and has been applied to complex multicomponent systems of compressible 
and reacting flows in, e.g., [6, 7, 29]. This paper is concerned with the simpler scalar problem 
of the full potential equation, which describes inviscid, irrotational, isentropic compressible 
flow. Though the full potential model is highly idealized, it remains the model of choice 
of external aerodynamic designers to date, because codes based thereupon offer reasonable 
turnaround times and in many cases high accuracy compared to state-of-the-art Navier- 
Stokes solvers. Though derived under the condition of isentropy, the full potential model 
remains useful in flows with weak shocks, with pre-shock Mach numbers of about 1.2 or 
less. It can also be extended by boundary layer patching to incorporate viscous effects, by 
a branch cut to accommodate lift, and by source terms to simulate powered engines. In 
engineering practice, accurately modeling such nonideal effects in complex geometries ac- 
counts for almost all of the lines of code, but the solution of the resulting discrete equations 
accounts for the majority of the execution time. The lower per-cell storage and computa- 
tional requirements of the potential model allow the use of grids dense enough to achieve low 
truncation error levels for complex geometries. The full potential equation also avoids the 
spurious entropy generation near stagnation often associated with Euler and Navier-Stokes 
codes for industrial complex geometries of interest. We justify the simply coded examples 
in this paper by our focus on a solution algorithm that should not require any changes other 
than greater irregularity in its sparse data structures to be useful in more practical settings. 

With Newton’s method as the outer iteration, a highly nonsymmetric and/or indefinite 
large, sparse Jacobian equation needs to be solved at every iteration to a certain accu- 
racy, which is often progressively tightened in response to a falling nonlinear residual norm. 
The most popular family of preconditioners for large sparse Jacobians on structured or un- 
structured grids, incomplete factorization, is difficult to parallelize efficiently flop-for-flop in 
its global form. In our approach, the ILU- preconditioner for the Newton correction equa- 
tions is replaced by a multi-level overlapping Schwarz preconditioner. The latter is not 
only scalably parallelizable (up to available granularities), but also possesses an asymptoti- 
cally optimal mesh- and granularity-independent convergence rate for elliptically dominated 
problems. Our two-level overlapping additive Schwarz algorithm uses a non-nested coarse 
space. Subdomain granularity, quality of subdomain solves, coarse grid density, strategy for 
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coarse grid solution, and inner iteration termination criteria are important factors in overall 
performance. We report numerical experiments on an IBM SP2 with up to 32 processors. 

The outline of this paper is as follows. In §2, we briefly derive the form of the full 
potential equation that serves as the point of departure for the numerics. The finite ele- 
ment discretization and the construction of an approximate Jacobian for the full potential 
equation are discussed in §3. §4 is devoted to the description of the basic components of 
the NKS algorithm. Several parallel implementation issues are explained in §5. Numerical 
results are summarized in §6. Finally, we offer some general remarks on the use of NKS 
algorithms in §7. 


2 The full potential problem 


For completeness, we summarize the derivation and assumptions of the full potential equa- 
tion of aerodynamics. For a more thorough development, see [23]. 

The equation of mass conservation in a steady state fluid flow can be written in diver- 
gence form, 

V ■ (pv) = 0, (1) 


where v = (v\,V 2 ) T is the velocity and p is the local density, respectively. We assume that 
the flow is irrotational, which implies that there exists a velocity potential 4> such that 
v = V$. Furthermore, the relation ^ = const . holds for isentropic flow of a perfect gas. 
With the above assumptions, we can integrate the inviscid momentum equations and obtain 
Bernoulli’s equation 


q * a* 

1 = const.. 

2 7-1 


( 2 ) 


where q = (v% + v |) 1//2 = ||V $||2 is the local flow speed. The sound speed a is defined by 
a 2 = dp/dp , where p is the local static pressure. By means of the above relations, the five 
unknown fields t?i, V 2 ,p,a, and p can be eliminated in favor of a single unknown function 
which solves the full potential equation: 


V • (/>($)V$) = 0. 


(3) 


Two forms of this equation are standard in the literature, depending upon whether the 
density is referenced to a uniform freestream (at oo) or to a stagnation point condition. We 
derive the freestream version as follows. From Bernoulli’s equation (2). 


we have that 


a 2 _ <4 , <4 

2 7-1 2 7-T 
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(5) 


where M ~ qj a is the Mach number, and M ^ is the freestream Mach number. From the 
definition of the sound speed and the pressure-density relation, we obtain 


a 


2 


<Usf£) 

dp 


'•y — 1 

= np , 


( 6 ) 
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or equivalently, (^_) 2 = (^) 7 ’. Therefore, 




p($) = Poo 




llo ) 


(7) 


Observe that while the density is positive in regions of validity, (3) may be locally hyperbolic. 

Equation (3) requires boundary conditions. In this paper, we consider only subsonic 
farfield boundaries. Since our emphasis is on the performance of the Schwarz precondi- 
tioning, we study a symmetric nonlifting case, thus avoiding consideration of the Kutta- 
Joukowsky boundary condition. To keep the geometry of the domain trivial, we use a 
classical transpiration boundary condition on a slit to represent the airfoil. Transpiration 
refers to a continuously parameterized injection and removal of fluid along a portion of 
the boundary to create a recirculation pocket with a bounding streamline attached to the 
domain boundary at both ends, over which the flow of interest passes inviscidly. Transpira- 
tion is implemented as an inhomogeneous Neumann condition. A theoretical discussion of 
the use of transpiration boundary conditions to model displaced surfaces can be found in 
[25]. For the farfield boundary condition, we use Dirichlet values of the potential upstream. 
More sophisticated farfield conditions are possible, and are required in the case of a lifting 
airfoil, but these conditions are sufficient for excellent agreement of our numerical results 
with standard nonlifting solutions. 


3 Finite element approximation 

Following Boeing’s TRANAIR code [42], we employ a finite element formulation of the 
two-dimensional full potential equation using bilinear elements. The existence, uniqueness, 
and regularity of the solution are not central to this paper, but have been discussed in the 
papers [31, 35] and references therein. Related finite element approaches for this class of 
full potential equations can also be found in [1, 16]. A finite volume scheme was given in 
[30] and a mortar element-based domain decomposition scheme was recently parallelized to 
high efficiency in [26]. 

3.1 Basic finite element scheme 

The finite element problem is formulated in terms of the weak form 

a($, v) — f p(<I>)V<I> * Vv cfft. 

Ju 

We use bilinear elements on a rectangular partition of ft denoted by ft/> = {r^, i = 1, . . M/J. 
Let {</>,■(£, ?/)} be the usual nodal basis functions. The numerical solution we seek has the 
form 

$,<&(*, y), 

and satisfies the following nonlinear algebraic equations 

Mh 

pmxivi)) 

i=i 



L 


V4>Vt> dSl = 0, 


( 8 ) 


3 



for all v in the test function space. Here is the center point of the rectangle r t . 

To simplify and speed up numerical integration, we introduce certain approximations when 
dealing with some of the nonlinear forms. The way that we treat the local nonlinear 
numerical integration in (8) is like that in [42]. Let us define a system of nonlinear equations 


F($)= i =0, (9) 


where 


Fi($,p($)) = £ *K*(*r» 
refih 





( 10 ) 


and ( x % , y c T ) is the center point 
of the nonlinear system F = 
define , v 

djm _ 

d$j 


of the element r. We construct the Jacobian matrix J = { J z] } 
0, approximately, as follows. For each pair of indices i,j, we 


.,-n Jr 




+ 


renh T 3 

where s = || V4>||2- To simplify the numerical integration, the exact Jacobian value above 
is replaced by 

Ji j= £ pmx c T ,y c T )) 

ren h 

( 11 ) 

£ tJ W awnl){ ™ VMda ’ 

where the value of dp/ds is calculated at the element center point. We remark here that, 
because the density function p is not a constant, the Jacobian matrix is generally non- 
symmetric and possibly indefinite. The explicit construction of the Jacobian matrix is not 
necessary if we use an unpreconditioned Newton-Krylov method; however, to implement a 
Schwarz preconditioner, explicit approximation of the Jacobian matrix is needed in each 
subdomain. 


/< 


V <p : ■ V<(>i)d€t + 


3.2 Density upwinding schemes 

For subsonic problems, the above mentioned finite element method is sufficient; however, 
for transonic cases upwinding has to be introduced in the density calculation in order to 
capture the weak shock in the solution. The proper use of an upwinding scheme is essential 
both to the success of the overall approach in finding the correct location and strength of 
the shock and to the convergence, or the fast convergence, of the inexact Newton’s method. 

As mentioned earlier, density p is assumed to be a constant in each element, and this 
constant is ordinarily determined by the four values of $ at the corners of the element, 
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Figure 1: The finite element stencil. 4> is stored at the cell vertices and p at the cell centers. 

through (7). Following [24, 42], if an element is determined to be supersonic, or nearly so, 
its density value is replaced by 

P = P- P v *V_p, (12) 

where V is the normalized element velocity and V_p is an upwind undivided difference. 
For example, with reference to Fig. 1, if V = (V Xy V y ) and V x ,V y > 0 in the element marked 
with pi, then 

Pi = P\ - p(V x (pi - pe) + Vy( Pl - ps )). 

Here p is the element switching function, 

p — vq max{0, 1 - Af c 2 /M 2 }, (13) 

where M is the element Mach number, M c is a pre-selected cutoff Mach number chosen to 
introduce dissipation just below Mach 1.0, and is a constant usually set to something 
between 1.0 and 3.0 to increase the amount dissipation in the supersonic elements. Pa- 
rameters M c and uo may be varied to advantage between Newton steps in problems with 
shocks. Roughly speaking, M c controls the spatial extent of the upwinding; as it drops 
below 1.0 upwinding is triggered in a greater number of subsonic (but nearly sonic) cells. 
M c , vq , and V together control the amount of the upwinding in a triggered cell. A low 
M c and a high uo stabilize convergence but diffuse the shock. As iterations progress, M c 
should approach 1.0 and v o should be decreased to steepen up a shock whose location and 
strength has converged. A well resolved shock will take many Newton steps to settle on its 
correct location, whereas a diffuse shock centers quickly on this location. A carefully chosen 
sequence of M c and i/q can considerably accelerate the Newton’s convergence; more details 
on this “viscosity damping” can be found in [41]. Another way to control the convergence 
of Newton’s method is through the use of iterated maximization of the switching function, 
as described below and apparently first discussed in [22]. 

For each element, p as defined by (13) is called the zeroth level switching function and 
is denoted more precisely as p(°\ In Fig. 1, there is a nonzero p^ for each element marked 
with pi. The first level switching function for the element marked with p\ is defined as 

p[ l) = max{/4 0) ,-.-,/4 0) }> 
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namely, is the maximum of the all the n values in its immediate neighborhood. A 
(k + l)-level switching function is defined recursively as the maximum of the neighboring 
fc-level // values. 

Results for k = 2 are reported in §6. A rather tight Mach cutoff is used, namely 

= 0.95, and we set vq to 1.0. 

We remark that large k results in greater discrete data dependency, or larger effective 
stencil size, in both the nonlinear function and the Jacobian. For example, if k = 0, the 
stencil contains at most 9 points (e.g., the nine mesh points immediately surrounding (i,j) 
in Fig. 1). If k = 1, then some of the “x” points may join the stencil depending the 
flow direction, and the stencil may contain as many as 16 points. The increase in the 
stencil bandwidth does not cause much of a problem in the nonlinear function evaluation, 
but would substantially increase the memory requirement of the Jacobian matrix, which 
is constructed and stored for the Schwarz preconditioner at the beginning of each Newton 
iteration. To keep the memory requirements small in practice, we do not calculate or store 
the matrix elements introduced by using the iterated switching function. Our numerical 
experiments show that this extra level of approximation of the Jacobian matrix does not, in 
fact, appreciably reduce its power as a preconditioner. This is analogous to the practice in 
[7] of using Jacobian blocks based on first-order upwinding to drive a second-order upwinded 
residual to zero, in an inexact Newton iteration. Though not much discussed in the theory of 
approximate Newton methods for systems arising from PDEs, such techniques are commonly 
applied in stationary iterations in steady-state aerodynamics codes. Especially in three 
space dimensions, using simplified upwinding in the Jacobian matrix dramatically reduces 
cost at a small expense in convergence rate degradation. 

4 Newton-Krylov-Schwarz algorithms 

NKS is a family of general purpose algorithms for solving nonlinear boundary value prob- 
lems of partial differential equations. In terms of software development, NKS has three 
components that can be handled independently. However, to achieve reasonable overall 
convergence, the three components have to be tuned simultaneously. We discuss these 
components in turn. 

4.1 The matrix-free Newton method 

In this subsection, we briefly discuss the well known matrix-free inexact finite-difference 
Newton algorithm, and the Eisenstat- Walker forcing functions [15]. Starting from an initial 
guess which is sufficiently close to the solution, a solution of the nonlinear system (9) 
is sought by using an inexact Newton method: For some 77^ £ [0, 1 ) find s k that satisfies 

\\F(* k ) + J{9 k )s k \\<Tik (14) 

and set = $* + A k$k, where Xk € (0,1) is determined by a line search procedure 

[11]. In practice, the method is insensitive to the details of the method used to determine 
Afc. Much more important is nonlinear continuation in grid density, dissipation, and other 
parameters. The iteration is continued until convergence, typically defined in terms of a 
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sufficiently small ||F($fc)||- The vector s*. is obtained by approximately solving the linear 
Jacobian system 

J(*k)s k = ~F(*k) 

with a Krylov space iterative method. The action of Jacobian J on an arbitrary Krylov 
vector w can be approximated by 

Finite-differencing with e makes such matrix-free methods potentially more susceptible to 
finite word-length effects than ordinary Krylov methods. Left preconditioning of the Jaco- 
bian with an operator B -1 can be accommodated via 

B~ l J{* k )w » 1 (B~ l F((Z k + ew)) - F(**)) , 

where F($jt) = B~ l F($k) is stored once, and right preconditioning via 

J{* k )B- l w » 1 (F(($fc + tB~ l w)) - F{* k j) . (15) 

Right preconditioning is preferable when the focus is on comparing different preconditioners 
in vitro , since the true linear residual norm that is measured as a by-product in Krylov 
method GMRES (see next subsection) and used in the termination test is independent 
of any right preconditioning. On the other hand, any left preconditioning changes this 
by-product residual norm. For this very reason, left preconditioning may be preferable 
when GMRES is applied in vivo as the solver for an inexact Newton method. When the 
preconditioning B~ l is of high quality, the left-preconditioned residual serves as an estimate 
of the error in the Newton update vector. This estimate can be employed in a termination 
condition. In this paper, one of our emphases is assessing preconditioner quality, and we 
report only right-preconditioned results. 

The most expensive component of the algorithm is the solution of the linear system with 
the Jacobian at each Newton iteration. As discussed in Eisenstat and Walker [15], when 
is far from the solution, the local linear model used in deriving the Newton method may 
disagree considerably with the nonlinear function itself, and it is unproductive to “over- 
solve” these linear systems. We tested several stopping conditions, including those discussed 
in [15], and found that the best choice for our problems, based on elapsed execution time for 
a fixed relative nonlinear residual norm reduction, is simply to set 77 * = 10 “ 2 ||jF($A;)|| 2 - In 
fact, even the looser rjk = 10” 1 ||i' 1 ($jt)||2 is sufficient for the first few Newton iterations, but 
not much time is saved by switching dynamically among these two already loose criteria, 
so we use the first throughout. 

4.2 Krylov iterative methods 

We use the GMRES method [36], to solve the linear system of algebraic equations 

Px = 6, (16) 


7 



where P is the matrix appears in (15), and b is the negative of the nonlinear Newton 
residual vector in (9). The method begins with an initial approximate solution x 0 € R n and 
an initial residual r 0 = b - Px 0 . At the m th iteration, a correction vector z m is computed 
in the Krylov subspace 

fC m (r 0 ) = span{r 0 ,Pro,---,/ >m_1 r 0 } 

that minimizes the residual, min 2€ ,t m ( ro ) P - P{ x o + z)h- The m th iterate is thus x m = 
x 0 + z m- To fit the available memory, one is sometimes forced to use the Ar-step restarted 
GMRES method [36]. However, in this case neither an optimal convergence property nor 
even convergence is guaranteed. In our experiments, we do not need to solve the linear 
systems very accurately; i.e., rj = 10~ 2 in 

\\b- Px m \\ a < T)\\r 0 \\ 2 

is sufficient to capture an accurate solution to the nonlinear problem, in both subsonic and 
transonic cases. We do observe that, for certain maximum Krylov subspace dimensions 
(for example 30. in a problem with approximately 10 4 times as many discrete unknowns) 
and certain Mach numbers ( M = 0.8), the restarted GMRES can never reduce the initial 
residual below 10 -5 . In other words, there is no linear convergence. It is further noticed 
in such cases that the residual norm measured as a by-product in GMRES is no longer the 
same as, or even close to, the true residual norm except at the restarting points, where it 
is freshly updated. 4 A loose linear convergence tolerance avoids this problem by returning 
to the Newton method with a step that is far from exact. In the delicate balance between 
few nearly exact Newton steps with expensive inner linear solutions and many inexact 
Newton steps with bounded-cost inner linear solutions, we find the bottom line of overall 
execution time best served by bounding the inner linear work. This approach is also found 
most effective in the context of inviscid aerodynamics based on the primitive variable Euler 
equations in [7]. It deprives Newton’s method of its asymptotic quadratic convergence, but 
provides steep linear convergence. 

4.3 Two-level overlapping Schwarz preconditioners with non-nested coarse 
spaces 

In this subsection, we discuss a two-level overlapping Schwarz preconditioner with inexact 
subdomain solvers and non-nested coarse grid. Let ft be the domain of the full potential 
equation. We first partition the domain into nonoverlapping substructures ft,, i = 1, • • • , N. 
To obtain an overlapping decomposition of the domain, we extend each subregion ft, to a 
larger region ft’, i.e., ft, C ft’. Only simple box decomposition is considered in this paper: 
all the subdomains ft; and ft’ are rectangular and are made up of integral numbers of fine 
mesh cells. For simplicity, we also assume that all substructures are of the same size. More 
precisely, the size of ft,, i = 1, • • •, N, is H x x H y and the size of ft’ is H x x H' y , where the 

4 We believe, after Saad (personal communication), that this may be due to a lack of floating point 
commutativity in the product that expresses z m in GMRES, namely z m = PVmV , where V m is a Gram- 
Schmidt basis for /C m and y is a coefficient vector of dimension m that satisfies a related least squares 
problem (see [ 36 ]). The effect seems related to drastic variations in the magnitude of successive elements of 

y- 
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H' are chosen so as to ensure a discrete overlap, denoted by ovlp , which is uniform in the 
number of fine grid cells all around the perimeter, i.e., 

ovlp = (H' x -H x )/2 = {H' y -H y )/2 , 

for interior subdomains. For boundary subdomains, we simply cut off the part that is 
outside ft. Fig. 3, which appears later with the definition of numerical boundary conditions, 
illustrates a decomposition with an overlap of three fine mesh cells. 

On each extended subdomain ft,-, we construct a so-called subdomain preconditioner 
B{ = {Jij}, where the node indexed by (i,j) belongs to the interior of ft’. J tj is calculated 
by using the formula (11). The density upwinding discussed earlier is used in the transonic 
cases. Homogeneous Dirichlet BCs are used on the internal subdomain boundary dft, fl ft, 
and the appropriate external boundary condition is used on the physical boundary if present. 

We next discuss the construction of the coarse grid and the coarse grid preconditioner. 
The coarse grid is built independently of the fine mesh. We cover ft with another uniform 
rectangular mesh ft// = { t % h , i = 1, . . ., Mjj}, and at each coarse node we introduce a 
bilinear finite element basis function The set of coarse nodes is not generally a 

subset of the fine mesh nodes. In other words, the discrete subspaces defined by the two 
meshes are generally non-nested [4]. Both coarse and fine grids cover the entire ft, and 
they share the same boundary, which they both resolve exactly because of its prescribed 
simplicity. (The case of a multi-level Schwarz preconditioner for geometrically complex 
grids, in which only the finest level exactly resolves the boundary geometry, is considered 
in [10].) The coarse grid preconditioning matrix Bq is defined by using formula (11) with 
respect to the basis functions {$,}. The coarse grid matrix arises from an independent 
discretization, not an agglomeration of fine grid matrix. No upwinding is used on the 
coarse grid even in the transonic case. Empirically, the convergence may be slowed down 
if the density upwinding is used at the coarse grid, since a poorly located shock may be 
“resolved” and added to the fine grid solution. We do not fully understand the reason for 
this slowdown, and believe we are not alone in regarding the choice of a coarse grid operator 
for mixed elliptic-hyperbolic problems as one of the most important outstanding questions 
in multilevel preconditioning. 

The interaction of the coarse and the subdomain preconditioners is through the inter- 
polation and restriction operations. We define the coarse- to-fine interpolation matrix, l\\ , 
as follows. Let Ifj = {l tJ } be an Mh x Mfj matrix, and 

h,j = 

where x,- is ith fine mesh node. The fine-to-coarse restriction matrix is defined as (/^) T , 
the transpose of Iff. The additive Schwarz preconditioner can be written as 

B~ l = I h H B»\l h H ) T + I x Bi\h) T + ■■■ + I N BJ{I N ) T . (17) 

Let n i be the total number of nodes in ft), then I{ is an Mh x n) extension matrix that 
extends each vector defined on ftj to a vector defined on the entire fine mesh by padding 
an n’ x n i identity matrix with zero rows. 

Various inexact additive Schwarz preconditioners can be constructed by replacing the 
matrices B x , i > 0, in (17) with convenient and inexpensive to compute matrices, such as 
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those obtained by using local incomplete factorizations. The coarse grid operator Bq 1 is 
always applied exactly. Some detailed comparisons of (17) with global ILU preconditioners 
on rather general scalar problems can be found in [5]. Experience with transonic potential 
problems in the Boeing TRANAIR code can be found in [40]. 

5 Parallel implementation issues 

We implemented the family of NKS algorithms on the IBM SP2. The top-level message- 
passing calls are implemented through the Chameleon Package of Gropp and Smith [19], 
which uses the IBM MPL library. 

The code is written in a hostless manner. Each processor is assigned one subdomain, 
and the information pertaining to the interior of the subdomain is uniquely owned by that 
processor and is not available to any other processors except by message passing. Following 
the parallel complexity study in [18], the low-storage coarse mesh information is duplicated 
in each of the processors. On each processor, we store the subvectors and subblocks of 
the Jacobian matrix associated with an extended subdomain. For the coarse-grid precondi- 
tioner, the right-hand vector is built by a parallel fine-to-coarse restriction operation. Once 
the right-hand vector is obtained, the coarse linear system is solved simultaneously on all 
of the processors. The solution is then added to the local subdomain solutions by using a 
parallel coarse- to-fine interpolation operation. In all the experiments that we have done, 
the size of the coarse linear system is so small that the CPU time spent on it is negligible. 
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Figure 2: Illustration of two-way buffer copies required at each nearest- neighbor boundary. 
For each action of the Schwarz preconditioner on a vector the data needed in the extended 
regions are copied from the interior of neighboring subdomains. The amount of data moved 
for each processor is proportional to the area of overlap. 


The multiplication of a vector with the Schwarz preconditioner is the most expensive 
operation in terms of memory consumption and execution time. At the beginning of each 
nonlinear iteration, the ^-dependent local and coarse grid preconditioning matrices are 
computed explicitly, and stored in Compressed Sparse Row (CSR) format. According to 
the desired type of local solver (see below), the matrices are factored, and the upper and 
lower triangular parts stored. The matrices for the interpolation and restriction between 
the coarse and fine meshes are independent of 4>, and are calculated in a preprocessing 
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step. After the solution of each subproblem is obtained, those portions that lie within 
the overlapping regions (bounded by the dashed boxes in Fig. 2) are sent to neighboring 
subdomains to complete the summation defined in (17). The length of the message is 
proportional to the area of overlap. 



Figure 3: Domain 1 1 with an exaggerated NACA 0012 curve at the bottom. The dashed lines 
indicate the partition of the domain into nonoverlapping substructures, and the dotted lines 
indicate the overlapping subdomains. The incomplete fine mesh of solid lines illustrates an 
overlap of 3 subintervals. Eg is the inflow, Ts the freestream, and T 4 the outflow boundary. 


6 Numerical results 

In this section, we report some numerical results obtained on the IBM SP2 with up to 32 
processors for both subsonic and transonic flows. The SP2 offers subsets of dedicated nodes 
through a batch scheduler. Other jobs on different dedicated subsets share the communica- 
tion network, but processor allocation tends to concentrate intercommunicating processors 
onto independent subnetworks. We report five performance metrics for each run: (1) the 
total number of Newton iterations; (2) the total number of GMRES iterations; (3) the total 
execution time (including the pre-processing step such as the decomposition of the mesh, the 
calculation of message lengths and the allocation of sparse matrices, all communication and 
synchronization overhead, etc.), which is an average over all processors; (4) the megaflop 
rate, which is a sum of the rates on each processor; and (5) the total communication time, 
which is an average over all processors (isolated out of (3)). Metrics (1) and (2) are of 
interest in understanding convergence rates, while (3), (4) and (5) are useful in assessing 
bottom-line performance and modeling scalability. 
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The computer code was first developed on a network of workstations, and then moved 
to the IBM SP2, changing only a UNIX makefile. To obtain the best performance of the 
code, in terms of either the elapsed time or the megaflop rate, is not the main purpose of 
this paper. We provide the execution time and megaflop information for all the calculations 
for completeness. Though compiler optimization was used, the listed megaflop rates are an 
order of magnitude below their peak values. Greater attention to cacheing is undoubtedly 
required to improve this situation, and will potentially be simplified when addressed in the 
future by the domain-oriented structure of the software. 

6.1 Test problem and parameter selection 

ft is a unit-aspect ratio square partitioned into a uniform rectangular meshes up to 512 x 512 
in size. Let q 0 0 , the farfield flow speed, be normalized to 1. In Fig. 3, let $ CO — fx 7oo dx. 
We assume the following boundary conditions. 


• On the farfield boundaries T 4 , Ts and 1^, we assume $ = $oo. 

• On T 2 , 

*7J — -V^oo 

dy 

where n = (n x ,n y ) is the unit outward normal, and where y = f(x) describes the 
shape of airfoil for x G IV Once the function f(x) is given, this condition becomes 


d<P 

dy 


-««>/(*). 


• On Ti and 1^, we impose for symmetry the no penetration condition 

_ d$ _ 
dn dy 


The functional form used for the NACA0012 geometry [38] is 

f(x) = 0.17814(V£ - x) + 0.10128(^(1 - x)) - 0.10968x 2 ( 1 - x) + 0.06090x 3 (l - x), 

for x 6 (0, 1). This unit interval is scaled to (1/3, 2/3) in the overall domain. The blunt 
leading edge of the airfoil poses a technical problem for the transpiration boundary condi- 
tion, since /'(x) is undefined there, so we slightly modify the function /(x). The curve in 
the interval [0, 0.047059] is replaced by a parabola with a matching function value at x = 0, 
and matching function and first derivative values at x = 0.047059. 

A number of parameters need to be specified in the NKS algorithms. The selection of 
some parameters, such as the number of subdomains, is related to the granularity of the 
architecture, not to the equation, itself. Altogether, we have 


• Switching-function parameters, in the transonic case (§3.2). The level of maximization 
of the switching function is set to 2, i/q is 1.0, and the cutoff Mach value is M 2 = 0.95. 
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• Finite differencing parameter, e (§4.1). We find that for the nondimensional scalar 
full potential equation, the numerics are not very sensitive to e. We simply set it to 
10 -8 , near the square root of the machine epsilon. 

• Newton convergence parameters (§4.1). The initial guess is a simple interpolation of 
the farfield boundary condition. Nonlinear convergence is declared following a 10 _1 ° 
relative reduction of the initial residual. The step size reduction ratio in the line 
search is 0.5 and the termination tolerance is 10“ 4 . 

• Krylov convergence parameters (§4.2). The convergence tolerance for the linear iter- 
ative solver at each Newton iteration rjk = 10~ 2 ||F($*)||2 • We restart GMRES at 
every 30th iteration. 

• Number of subdomains, ns (§4.3). Since only the additive version of Schwarz is under 
consideration, we always set that the number of subdomains is the same as the number 
of processors, np, which varies from 8 (the minimum required to store the problem) to 
32 (the maximum available within power-of-two configurations). (In a multiplicative 
algorithm [37], we would set n to np times the number of colors.) 

• Overlapping size, ovlp (§4.3). In fact, there are two overlapping sizes, in x and y 
directions. In this paper, we assume the same number of fine mesh cells, ovlp = 
1, . . .,5, are extended in both directions. 

• Coarse grid size (§4.3). This varies from no coarse grid (0 x 0) to a coarse grid with 
a modest number of points in each subdomain (10 x 11). (The coarse grid cells are 
square, but asymmetry in the employment of Neumann boundary conditions in the x 
and y directions makes the total number of gridpoints off by one.) 

• Level of fill, k , in ILU (§4.3). According to our past experience with multilevel pre- 
conditioning [5] and similar experience on a industrial-grade transonic potential code 
[33], relatively modest fill-in is optimal for small subdomains. Intuitively, little is 
lost relative to the coupling already sacrificed at subdomain boundaries. However, 
as the local memory keeps increasing on powerful modern parallel computers, such 
as the IBM SP2, the size of the subdomain problems can be quite large. For large 
subdomain problems, low level of fill-in is no longer as effective, k varies from 0 to 5 
in our experiments, then jumps discontinuously to the full band in the case of exact 
subdomain solves. 

6*2 Observations — subsonic case 

Our first test case corresponds to a subsonic problem with Moo = 0.1. The linear systems 
that arise fall within the elliptic theory for Schwarz [37]. It takes 6 Newton iterations to 
reduce the initial nonlinear residual by a factor of 10” 10 . Because of the Krylov dimension 
cut-off. the convergence is linear; see the left panel in Fig. 4. The top portion of Table 1 
shows the convergence performance for a fixed-size problem of 512 x 512 uniform cells with an 
increasing number of subdomains: 8, 16 and 32. The overlap size is fixed at 3 h. The density 
of the unnested uniform coarse grid varies from 0 x 0 to 10 x 11. Key observations from this 
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Figure 4: For M <*, = 0.1, the left figure shows the history of the Newton residual, and the 
right shows the (upper surface) C v curve at convergence. 

example are as follows: (1) Even a modest coarse grid makes a significant improvement in 
an additive Schwarz preconditioner, especially when the number of subdomains is large. As 
much as 40% of the execution time can be saved when adding a 2x3 coarse grid to a no coarse 
grid preconditioner, for the 32-subdomain case. (2) A law of diminishing returns sets in at 
roughly one point per subdomain. (3) When using 8 processors, the total communication 
time is always less than 5% of the total computational time, however, it becomes as much 
as 26% when using 32 processors. (This includes synchronization delays as well as the time 
actually delivering the message packets from application process to application process.) 
Table 2 shows the effects of the overlap size. For simplicity, we fix the coarse grid to 6 x 7 
for all test cases. The overlap size is given here in absolute terms, i.e., the distance between 
the boundary of the unextended subdomain and the extended subdomain, not relative to 
the diameter of the subdomain. All the subproblems are solved with the exact Gaussian 
elimination in sparse format. Since the fine mesh size is fixed, when using small number of 
processors, such as np = 8, the single processor memory requirement is substantial. In this 
case, increase the overlap size can indeed reduce the total number of GMRES iterations, 
but the reduction of the total execution time is rather limited. 

In Table 3, we present the results when the subproblems are solved with ILU(fc) for 
various levels of fill-in. The overlap size is 3 h, and the coarse grid is 7 x 8. The conclusion 
from the tests shown is that the larger the k , the faster the method becomes; see the boxed 
numbers in Table 3. When using a small number of processors, like 8, the best execution 
time is obtained with ILU(5); compare the upper portions of Tables 1, 2 and 3, However, if 
the processor number is large, the optimal result can only be obtained by considering several 
parameters: ovlp, k, the coarse mesh size, and perhaps others. We have not simultaneously 
varied all relevant parameters to get the best results, but have presented controlled slices 
through parameter space for insight. 

Load balancing should not be a significant issue in the dedicated-processor subsonic case. 
All processors have nearly the same computational load, except those which have to handle 
the Neumann boundary conditions. This is no longer true for the transonic calculation, 
when a shock resides in some of the subdomains. See §6.4. 
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Figure 5: For = 0.8, the left figure shows the history of the Newton residual, and the 
right shows the (upper surface) C p curve at convergence. 
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Figure 6: The left figure shows the convergence history of the C v curves at = 0.8. The 
right figure shows the Mach contours of the final solution at M 0 0 — 0.8. 

6.3 Observations — transonic case 

Figure 5 shows the convergence history and converged C v for a transonic problem with 
Afoo = 0.8. The first and most important observation is that without a proper upwinding 
discretization, all three components of NKS can fail. Fig. 6 shows the convergence history 
in terms of the C p curves. We note that it takes only 4 to 5 iterations for the Newton's 
method to establish the neighborhood of the shock, but another 15 or so iterations to move 
it to the exact location. Mach contours at the final solution are given in Fig. 6. While the 
shock is setting up, the linear convergence of Newton’s method is interrupted; see the left 
panel of Fig. 5. 

The results for coarse grids of varying size are summarized at the bottom part of Table 1. 
The columns marked 0x0 and 2x3 reveal an interesting result for a mixed elliptic- hyperbolic 
problem. The inclusion of a small coarse grid can reduce the total number of the linear 
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Figure 7: The GMRES convergence history for the entire nonlinear iteration. The left figure 
does not have a coarse space in the Schwarz preconditioner, the right figure contains a 7 x 8 
coarse space. The number at the tail of each curve corresponds to the Newton iteration 
number. 

iterations, as well as the total execution time, by a factor of 30%. An optimally chosen 
coarse grid size can lead to a greater savings. In Fig. 7, we overlay the convergence histories 
of all the linear solutions in a complete nonlinear calculation. The history in the left panel 
is without a coarse grid, and that in the right with a 7 x 8 coarse grid. The corresponding 
execution time requirements can be found in Table 1. 

The number of linear iterations and the total execution time can be reduced even further 
if a proper overlap size, which is not usually very small, is used; see Table 2. 

The best result, in terms of the total execution time, among all the test calculations is 
obtained using a ILU(fc), with k = 5, as the subproblems solver; see Table 3. It takes less 
than *l\ minutes on the 32-processor IBM SP2 to set up and solve the Mach 0.8 nonlinear 
system with more than a quarter of a million unknowns. 

6*4 Parallel efficiency 

The parallel efficiency of the present algorithm-software-hardware system is encouraging, 
but it is useful to sort out in detail where efficiency is lost in going from 8 to 32 processors. 
We display the parallel performance in Table 4, whose first three columns are excerpted 
from the first and last columns of Table 3. The execution time data in the last column 
of Table 3 is the best, or nearly the best, for each Mach number and parallel granularity 
out of all of the parameter combinations considered, and is therefore the most meaningful 
from which to draw parallel efficiency conclusions, though more flattering conclusions could 
be drawn from runs that were performing more computation per node per communication 
exchange. 

After the number of processors, we list the number of linear GMRES iterations per 6 
Newton steps in the upper (subsonic) half of the table, and per 19 Newton steps in the 
lower (transonic) half of the table. Then we list the execution time, per 6 or 19 Newton 
steps, respectively, these being the typical number of Newton steps required to fully solve 
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the nonlinear problem. Fig. 7 shows that, for the transonic case, the number of GMRES 
steps can vary significantly over the course of a complete set of Newton iterations, but the 
mean and the median are close. 

Consider the following idealized model for the execution time of the fixed-size problem 
on p processors. Let T(p) be the overall execution time, I(p) the number of linear iterations, 
and C(p) the average cost per iteration. (We note that in GMRES, the average cost per 
iteration is not independent of the number of iterations, because of the orthogonalization 
overhead of later Krylov vectors is greater than earlier vectors, but we assume that the 
dominant cost per iteration is the parallel Schwarz preconditioning.) 

The overall parallel efficiency is defined as p(p) = p ^ - ^ , where T(p) = I(p) ■ C{p). 
Since we lack results for p = 1 on this problem of industrial size (512 x 512), we replace 
all efficiencies by relative efficiencies with respect to the minimum configuration of p\ = 8. 
The overall relative parallel efficiency is therefore defined as r}{p\ —* p) = ■ . The 

numerical efficiency, a measure of the robustness of the preconditioning with respect to 
increasing granularity, is p„ umer (pi -> p) = The implementation efficiency is the 

remaining factor, PimpliPi — p ) = , so that rj(pi -*• p) = rj numer (p) x p i mp i(p). 

The numerical efficiency is nearly 90% or above for all cases — that is, the convergence 
rate of the preconditioned linear system hardly degrades with increasing parallel granu- 
larity. Approximately 13 GMRES steps are required for each subsonic Jacobian system 
and approximately 23 GMRES steps for each transonic Jacobian system. This insensitivity 
to granularity for a multilevel preconditioned operator is predicted by the Schwarz theory 
for the subsonic case, and seems to be a fortunate consequence of the relatively confined 
supersonic pocket of flow in the transonic case. 

The implementation efficiency accounts for the most significant factor of overall efficiency 
decline. The difference between the subsonic and transonic implementation efficiencies at 
high granularity can be attributed to load imbalance, since the cells requiring upwinding are 
concentrated into a small number of processors. (A more sophisticated dynamic mapping 
algorithm could address this problem, but this is beyond present scope.) The subsonic 
degradation of 76% in going from 4 to 32 nodes is identified as the chief remaining loss. 
Redundant work and higher communication-to-computation ratio in the overlap regions, 
which account for a steadily increasing fraction of all points in a fixed-size problem explain 
the majority of this efficiency loss, which would disappear in a scaled problem with fixed-size 
subdomains on each processor. 

6.5 Sequential comparison with global ILU(fc) preconditioners 

The results of this section establish Schwarz preconditioning as numerically attractive and 
reasonably parallel efficient, but it is natural to ask whether its utility is limited to dis- 
tributed memory implementations of Newton-Krylov methods. To satisfy curiosity on this 
point, we conclude with tests of Schwarz preconditioning against the popular global 1 L U [k). 
k = 0, . . .5, family of preconditioners on a non-dedicated single-processor SUN SPARCsta- 
tion with 512MB of memory. The results are summarized in Table 5. Because of the overlap 
and the coarse solve, the Schwarz preconditioner needs more memory, even if all subdomain 
problems are solved inexactly with ILU(5), than the other global ILU(Ar) preconditioners. 
On the other hand, Schwarz outperforms all the global solvers in terms of total GMRES 


17 



iteration count and the total execution time. Part of the reason for the fine performance of 
the Schwarz method is the much higher uni-processor Megaflop rating, which is presumably 
related to much improved cache locality. 

7 Conclusions 

We have investigated computationally the effectiveness of Newton-Krylov-Schwarz methods 
applied to the full potential equation of aerodynamics in some simplified situations in two 
space dimensions. Best performance is obtained with modest overlap, a modest coarse grid 
(one or two points per processor), modest-to-generous fill in the subdomain ILU precon- 
ditioners, and uniformly loose convergence tolerances on the Krylov iterations within each 
Newton step. For subsonic problems, the theoretically expected performance of the method 
is essentially achieved. For the transonic case, the numerics are more encouraging than ex- 
isting theory. Overall computation time is approximately six times greater for the transonic 
than for the subsonic case, with current upwinding strategies. This can be factored into 
a three-fold increase in the number of Newton steps in the transonic case, and a two-fold 
increase in the number of Krylov iterations per Newton step. 

Two strategies that should be employed on more nonlinearly taxing problems that we 
have not considered here are mesh sequencing and pseudo-transient continuation. Their 
purpose is to deliver an initial iterate for the steady-state form of Newton’s method employed 
in this paper that is already in the local domain of convergence on the finest grid. (Observe, 
for instance, that the number of Newton steps required on = 0.8 problem the 256 x 256 
grid in Table 5, is roughly half that of the corresponding problem on the 512 x 512 grid in 
Table 3. If the shock is correctly located on a (relatively) coarse grid, the plateau of Fig. 5 
will be diminished on a finer grid that is initialized from the coarse grid solution.) Our rapid 
turnaround times for two-dimensional problems artificially deemphasize the importance 
of these strategies in large, complex nonlinear problems. In addition to globalizing the 
Newton convergence, continuation strategies tend to improve the linear conditioning of the 
intermediate problems, and are therefore potentially useful even in problems (such as ours) 
for which simple initial guesses on the finest grid lead to convergence. 

The broadest motivation for Newton-Krylov-Schwarz methods is the need to solve large- 
scale problems with complex discretizations on distributed-memory systems with limited 
memory per node. The matrix- free aspect of Newton permits shortcuts in Jacobian for- 
mation storage while the domain decomposition aspect of Schwarz leads to load-balanced 
data-to-memory maps that render communication subdominant in the preconditioning. The 
amount of work done in the Krylov iteration can be adjusted to produce an overall method 
with the best balance between the nested components. 
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Table 1: Varying the coarse grid size. Fine mesh 512 x 512, M ^ = 0.1 and 0.8, sparse 
LU for all subproblems, ovlp = 3 h. “Newton” is the total number of Newton iterations. 
“GMRES” is the total number of GMRES iterations occur in all of the Newton iterations. 
“EXEC” is the execution time per processor in seconds for the entire calculation. “COMM” 
is the total communication time per processor in seconds. 


Tip 

Coarse Grid 

0x0 2x3 4x5 

6x7 


8x9 

10 x 11 

M 0 o = 0.1 


Newton 

6 6 6 

6 


6 

6 


GMRES 

144 81 59 

53 


50 

51 

8 

EXEC 

136.79 125.30 104.18 

97.28 


94.18 

94.63 


COMM 

2.14 2.10 1.95 

2.39 


3.28 

4.24 


Mflop/s 

183.59 157.02 170.67 

177.63 


180.94 

180.93 


Newton 

6 6 6 

6 


6 

6 


GMRES 

167 92 66 

54 


54 

54 

16 

EXEC 

72.18 50.34 42.10 

39.18 


40.95 

42.32 


COMM 

2.25 2.38 2.37 

2.88 


397 

5.48 


Mflop/s 

295.19 266.83 277.39 

280.01 


272.65 

265.44 


Newton 

6 6 6 

6 


6 

6 


GMRES 

227 105 72 

64 


53 

57 

32 

EXEC 

47.94 29.55 24.32 

24.46 


23.43 

26.59 


COMM 

3.36 2.92 2.95 

4.13 


4.51 

7.00 


Mflop/s 

498.04 477.99 463.72 

440.64 


423.16 

390.33 

OO 

O 

II 

S’ 


Newton 

20 19 19 

20 


19 

19 


GMRES 

814 435 359 

350 


307 

311 

8 

EXEC 

757.45 548.52 490.64 

498.08 


458 75 

466.20 


COMM 

10.97 11.63 12.16 

15.68 


19.07 

24.52 


Mflop/s 

158.70 158.47 162.20 

163.44 


163.00 

161.15 


Newton 

' 19 19 20 

20 


19 

19 


GMRES 

849 483 398 

349 


320 

330 

16 

EXEC 

432.78 327.77 306.25 

291.14 

□ 

277.06 

289.89 


COMM 

11.69 12.90 14.14 

18.43 


23.26 

32.79 


Mflop/s 

233.50 224.48 219.17 

215.38 


210.78 

204.68 


Newton 

19 19 19 

19 

20 

20 


GMRES 

1142 636 454 

387 

391 

385 

32 

EXEC 

333.79 244.36 207.54 

200.21 

215.46 

227.84 


COMM 

16.09 18.43 19.40 

25.26 

34.46 

47.68 


Mflop/s 

360.18 333.53 317.77 

298.29 

284.96 

267.36 
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Table 2: Varying the overlapping size ovlp. Fine mesh 512 x 512, M 0 0 = 0.1 and 0.8. 
Exact LU for all subproblems. Coarse grid size 6x7. “Newton” is the total number of 
Newton iterations. “GMRES” is the total number of GMRES iterations occur in all of the 
Newton iterations. “EXEC” is the execution time per processor in seconds for the entire 
calculation. “COMM” is the total communication time per processor in seconds. 


np 

| 

ovlp = Ih 

ovlp — 2 h 

ovlp — 

II 

o 

CO 

ovlp = 

5 h 





= 0.1 





6 

6 

6 

6 

6 


GMRES 

77 

60 

53 

50 

50 

8 

EXEC 

101.12 

99.90 

97:31 

98.31 

100.50 


COMM 

3.22 

2.59 

2.46 

2.36 

2.59 



161.03 

174.33 

177.61 

186.71 



Newton 

6 

6 

6 

6 

6 


GMRES 

82 

61 

54 

49 

49 

16 

EXEC 

51.77 

45.46 

44.05 

44.42 

45.21 


COMM 

3.87 

3.07 

2.95 

2.77 

2.64 


Mflop/s 

267.85 

276.36 

279.31 

277.96" 

284.54 


Newton 

6 

6 

6 

6 

6 


GMRES 

93 

72 

64 

57 

52 

32 

EXEC 

30.35 

28.28 

26.75 

26.31 

1 26.19) 




5.42 

4.24 

3.64 

3.58 

3.19 


Mflop/s 

387.34 

421.17 

441.80 

452.00 

460.17 




M oc 

= 0.8 




Newton 

19 

19 

20 

20 

20 


GMRES 

435 

365 

350 

319 

315 

8 

EXEC 

527.50 

489.65 

498.78 

491.47 

492.52 


COMM 

17.66 

15.87 

16.31 

15.86 

16.49 


Mflop/s 

145.19 

160.11 

163.11 

170.98 

172.56 



19 

19 

20 

20 

19 


GMRES 

462 

377 

349 

324 

303 

16 

EXEC 

315.39 

294.34 

292.02 

280.61 

278.21 



COMM 

22.74 

19.22 

18.70 

17.67 

17.10 


Mflop/s 

216.43 

211.85 

196.48 

221.34 

219.24 


'mmsM 

19 

19 

19 

19 

19 


GMRES 

627 

445 

387 

354 

344 

32 

EXEC 

251.96 

207.12 

199.90 

188.99 

189.44 


COMM 

38.08 

27.51 

24.12 

23.02 

22.31 


Mflop/s 

276.02 

301.20 

298.73 

313.36 

320.93 
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Table 3: Varying the level of ILU(£) fill-in. Fine mesh 512 x 512, = 0.1 and 

0.8. Coarse grid is 7 x 8. ovlp = 3 h. “Newton” is the total number of Newton iterations. 
“GMRES” is the total number of GMRES iterations occur in all of the Newton iterations. 
“EXEC” is the execution time per processor in seconds for the entire calculation. “COMM” 
is the total communication time per processor in seconds. 


np 

ILU(Jb) 

o 

II 

-s* 

k = 1 

CM 

11 

• 4 * 

k = 3 

k = 4 

k = 5 




Moo 

= 0.1 





Newton 

6 

6 

6 

6 

6 

6 


GMRES 

307 

175 

127 

98 

84 

75 

8 

EXEC 

171.46 

110.07 

88.26 

76.11 

74.28 

71.74 


COMM 

14.08 

8.10 

5.87 

4.54 

4.44 

3.62 


Mflop/s 

119.80 

113.63 

109.46 

102.36 

96.00 

95.43 


Newton 

6 

6 

6 

6 

6 

6 


GMRES 

299 

178 

129 

101 

87 

78 

16 

EXEC 

97.82 

64.14 

51.19 

43.34 

40.81 

40.07 



COMM 

16.73 

9.42 

6.85 

5.38 

4.80 

3.88 


Mflop/s 

208.04 

201.60 

195.17 

188.34 

183.56 

179.69 


Newton 

6 

6 

6 

6 

6 

6 


GMRES 

298 

179 

130 

104 

90 

82 

32 

EXEC 

64.34 

41.92 

32.73 

28.18 

26.34 

25.57 


COMM 

19.26 

11.44 

8.14 

6.63 

5.70 

4.91 


Mflop/s 

324.76 

321.72 

318.07 

308.78 

304.06 

305.19 




Moo 

= 0.8 





Newton 

19 

20 

18 

19 

19 

19 


GMRES 

1638 

911 

622 

547 

462 

424 

8 

EXEC 

863.15 

523.33 

382.27 

364.11 

335.47 

330.76 

r 


COMM 

77.30 

42.64 

29.13 

25.90 

21.62 

19.81 


Mflop/s 

127.49 

125.37 

124.04 

123.70 

122.13 

119.90 


Newton 

19 

20 

20 

19 

19 

19 


GMRES 

1693 

924 

707 

531 

460 423 

16 

EXEC 

526.40 

316.98 

255.54 

209.27 

195.96 

192.16 


COMM 

90.88 

49.9 

38.32 

28.55 

25.14 22.46 


Mflop/s 

220.22 

213.73 

215.42 

214.21 

210.64 208.59 


Newton 

19 

20 

20 

19 

20 20 


GMRES 

1746 

961 

799 

596 

557 501 

32 

EXEC 

361.81 

217.29 

187.37 

150.59 

150.29 

145.12 



COMM 

110.89 

61.46 

51.32 

38.02 

35.60 32.07 


Mflop/s 

342.40 

336.48 

342.84 

341.52 

343.77 340.06 
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Table 4: Parallel efficiency. Fine mesh 512 x 512, = 0.1 and 0.8. Coarse grid is 

7x8. ovlp = 3 h. “GMRES” and “EXEC” are the total number of GMRES iterations and 
seconds of execution time per 6 Newton steps in the upper half of the table, and per 19 
Newton steps in the lower half. 


Tip 

GMRES EXEC 

Vnumer 

T]impl 

V 



M 00 = 

= 0.1 



8 

75 

71.1 

- 

- 

~ 

16 

78 

40.1 

0.961 

0.931 

0.895 

32 

82 

25.6 

0.915 

0.766 

0.701 



Moo = 

= 0.8 



8 

424 

330.1 

- 

- 

- 

16 

423 

192.2 

1.002 

0.860 

0.861 

32 

475 

137.9 

0.891 

0.674 

0.601 


Table 5: Sequential comparison of the additive Schwarz preconditioner(OSM) with the 
global ILU(fc), k = 0, . . . , 5, preconditioners on a single processor Sun workstation. The fine 
mesh is 256 x 256. The specifications of OSM are: 8 subdomains, 3 h overlap, 7x8 coarse 
grid, and ILU(5) as the subdomain solver. MEM is the total memory needed to store the 
preconditioning matrix in Megabytes. 



OSM 

ILU(0) 

ILU(l) 

ILU(2) 

ILU(3) 

ILU(4) 

ILU(5) 




Moo = 

0.1 




Newton 

6 

6 

6 

6 

6 

6 

6 

GMRES 

48 

509 

280 

195 

148 

119 

103 

EXEC 

n 

496.29 

2754.54 

1599.84 

1178.55 

970.72 

817.42 

771.04 

MEM(MB) 

24.22 

10.58 

10.57 

13.65 

16.83 

20.12 

22.87 

M flop/s 

32.12 

6.68 

6.48 

6.19 

6.04 

5.85 

5.40 




Moo = 

0.8 




Newton 

11 

12 

12 

11 

11 

11 

11 

GMRES 

136 

1391 

666 

464 

328 

268 

217 

EXEC 

n 

1240.18 

7766.42 

3826.05 

2848.73 

2278.81 

1761.89 

1526.82 

MEM(MB) 

24.40 

10.64 

10.63 

13.75 

16.95 

20.26 

23.06 

Mflop/s 

36.57 

6.53 

6.48 

6.20 

5.63 

5.99 

5.79 
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